Machine learning comes in two varieties:
\[(X_1, y_1), (X_2, y_2), ..., (X_n, y_n)\]
Can we learn \(f\) by optimising \(\theta\) on training data?
\[\begin{equation} H_i = a + b S_i + \epsilon_i \end{equation}\]
where \(H_i\) is the house price; \(S_i\) is the size in square feet; and \(\epsilon_i\) is an error term.
Choose a model to minimise the sum of squared errors, that is:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} \epsilon_i^2 \end{equation}\]
which is the same as choosing values of \(a\) and \(b\) to minimise:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i))^2 \end{equation}\]
Least absolute deviation loss:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} |\epsilon_i| \end{equation}\]
Quartic power loss:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} \epsilon_i^4 \end{equation}\]
Suppose we want to minimise a least-squares loss function:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i))^2 \end{equation}\]
Choose \(a\) and \(b\) to minimise this loss \(\implies\) differentiate!
determine \(\hat{a}\) and \(\hat{b}\) as those minimising \(L\):
\[\begin{align} \frac{\partial L}{\partial a} &= -\frac{2}{K}\sum_{i=1}^{K} (H_i - (a + b S_i)) = 0\\ \frac{\partial L}{\partial b} &= -\frac{2}{K}\sum_{i=1}^{K} S_i (H_i - (a + b S_i)) = 0 \end{align}\]
For general loss functions, no solution exists. That is, usually equations like:
\[\begin{align} -\frac{2}{K}\sum_{i=1}^{K} (H_i - (a + b S_i)) &= 0\\ -\frac{2}{K}\sum_{i=1}^{K} S_i (H_i - (a + b S_i)) &= 0 \end{align}\]
have no solution. (Here, they actually do.)
Instead of solving equations directly, we use gradient descent optimisation
\[\begin{align} a &= a - \eta \frac{\partial L}{\partial a}\\ b &= b - \eta \frac{\partial L}{\partial b} \end{align}\]
until \(a\) and \(b\) no longer change. \(\eta\) is the learning rate
\[\begin{equation} H_i = a + b S_i + c S_i^2 + \epsilon_i \end{equation}\]
What does this model look like?
Least-squares loss function:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i + c S_i^2))^2 \end{equation}\]
\[\begin{align} a &= a - \eta \frac{\partial L}{\partial a}\\ b &= b - \eta \frac{\partial L}{\partial b}\\ c &= c - \eta \frac{\partial L}{\partial c} \end{align}\]
\[\begin{equation} H_i = a + b S_i + c S_i^2 + d S_i^3 + ... \epsilon_i \end{equation}\]
Hold out a separate validation set to test model predictions on
No:
\[\begin{equation} y_i \sim \text{Bernoulli}(\theta_i) \end{equation}\]
where \(0\leq \theta_i \leq 1 = Pr(y_i=1)\)
is given by:
\[\begin{equation} \text{Pr}(y_i|\theta_i) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i} \end{equation}\]
so that \(\text{Pr}(y_i=1) = \theta_i\) and \(\text{Pr}(y_i=0) = 1 - \theta_i\)
In logistic regression, we use logistic function:
\[\begin{equation} \theta_i = f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_i))} \end{equation}\]
assume data are i.i.d., the likelihood is:
\[\begin{equation} L=p(\boldsymbol{y}|\beta,\boldsymbol{x}) = \prod_{i=1}^{K} f_\beta(x_i)^{y_i} (1 - f_\beta(x_i))^{1 - y_i}. \end{equation}\]
Can use gradient descent to find maximum likelihood estimates (or estimate using Bayesian inference).
straightforward to extend the model to incorporate multiple regressions:
\[\begin{equation} f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))} \end{equation}\]
But how to interpret parameters of logistic regression?
another way of writing logistic function:
\[\begin{align} f_\beta(x_i) &= \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))}\\ &= \frac{\exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
so that
\[\begin{align} 1 - f_\beta(x_i) = \frac{1}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
taking the ratio:
\[\begin{equation} \text{odds} = \frac{f_\beta(x_i)}{1-f_\beta(x_i)} = \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}) \end{equation}\]
so that
\[\begin{equation} \log\text{odds} =\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i} \end{equation}\]
meaning (say) \(\beta_1\) represents the change in log-odds for a one unit change in \(x_{1}\)
All available on SOLO:
Coursera: